ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Size: px

Start display at page:

Download "ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T."

Kelly O’Neal’
6 years ago
Views:

1 Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the full model for logistic regression and the (k + 1) parameters α, β 1,, β k are estimated For the saturated model, the Y i X i = x i independent Binomial(n i,π i ) for i = 1,, N where ˆπ i = Y i /n i This model estimates the N parameters π i Let l SAT (π 1,, π n ) be the likelihood function for the saturated model and let l FULL (α, β) be the likelihood function for the full model Let L SAT = log l SAT (ˆπ 1,, ˆπ N ) be the log likelihood function for the saturated model evaluated at the MLE (ˆπ 1,, ˆπ N ) and let L FULL = log l FULL (ˆα, ˆβ) be the log likelihood function for the full model evaluated at the MLE (ˆα, ˆβ) Then the deviance D = G 2 = 2(L FULL L SAT ) The degrees of freedom for the deviance = df FULL = N k 1 where N is the number of parameters for the saturated model and k + 1 is the number of parameters for the full model The saturated model is usually not very good for binary data (all n i = 1) or if the n i are small The saturated model can be good if all of the n i are large or if π i is very close to 0 or 1 whenever n i is small If X χ 2 d then E(X) =d and V (X) =2d An observed value of x>d+3 d is unusually large and an observed value of x<d 3 d is unusually small When the saturated model is good, a rule of thumb is that the logistic regression model is ok if G 2 N k 1 (or if G 2 N k 1+3 N k 1) An estimated sufficient summary or ESS plot is a plot of w i =ˆα + ˆβ T x i versus Y i with the logistic curve of fitted proportions ˆπ(w i )= ew i 1+e w i added to the plot along with a step function of observed proportions 29) Suppose that w i takes many values (eg the LR model has a continuous predictor) and that k +1<< N Know that the LR model is good if the step function tracks the logistic curve of fitted proportions in the ESS plot Also know that you should check that the LR model is good before doing inference with the LR model See HW6 4 1

2 Response = Y Terms = (X 1,, X k ) Sequential Analysis of Deviance Total Change Predictor df Deviance df Deviance Ones N 1=df o G 2 o X 1 N 2 1 X 2 N 3 1 X k N k 1=df FULL G 2 FULL Data set = cbrain, Name of Fit = B1 Response = sex Terms = (cephalic size log[size]) Sequential Analysis of Deviance Total Change Predictor df Deviance df Deviance Ones cephalic size log[size] Know how to use the above output for the following test Assume that the ESS plot has been made and that the observed proportions track the logistic curve If the logistic curve looks like a line with small positive slope, then the predictors may not be useful The following test asks whether ˆπ(x i ) from the logistic regression should be used to estimate P (Y i =1 x i ) or if none of the predictors should be used and for all i =1,, N N N P (Y i =1) π Y i / n i i=1 i=1 30) The 4 step (log likelihood) deviance test is i) H o : β 1 = = β k =0 H A : not H o ii) test statistic G 2 (o F )=G 2 o G 2 FULL iii) The p value = P (W > G 2 (o F )) where W χ 2 k has a chi square distribution with k degrees of freedom Note that k = k +1 1 =df o df FULL = N 1 (N k 1) iv) Reject H o if the p value <δand conclude that there is a LR relationship between Y and the predictors X 1,, X k If p value δ, then fail to reject H o and conclude that there is not a LR relationship between Y and the predictors X 1,, X k See HW6 6a 2

3 After obtaining an acceptable full model where logit(π(x i )) = α + β 1 x i1 + + β k x ik = α + β T x, try to obtain a reduced model Y i X Ri = x Ri independent Binomial(n i,π(x Ri )) where logit(π(x Ri )) = α + β R1 x Ri1 + + β Rm x Rim = α R + β T Rx Ri and {x Ri1,, x Rim } {x 1,, x k } Let x R,m+1,, x Rk denote the k m predictors that are in the full model but not in the reduced model We want to test H o : β R,m+1 = = β Rk = 0 For notational ease, we will often assume that the predictors have been sorted and partitioned so that x i = x Ri for i =1,, k Then the reduced model uses predictors x 1,, x m and we test H o : β m+1 = = β k = 0 However, in practice this sorting is usually not done Assume that the ESS plot looks good Then we want to test Ho: the reduced model can be used instead of the full model versus H A : the full model is (significantly) better than the reduced model Fit the full model and the reduced model to get the deviances G 2 FULL and G2 RED 31) The 4 step change in deviance test is i) H o : the reduced model is good H A : use the full model ii) test statistic G 2 (R F )=G 2 RED G2 FULL iii) The p value = P (W >G 2 (R F )) where W χ 2 k m has a chi square distribution with k degrees of freedom Note that k is the number of predictors in the full model while m is the number of predictors in the reduced model Also notice that k m = (k +1) (m +1)=df RED df FULL = N m 1 (N k 1) iv) Reject H o if the p value <δand conclude that the full model is (significantly) better than the reduced model If p value δ, then fail to reject H o and conclude that the reduced model is good See HW6 6b 32) If the reduced model leaves out a single variable X i, then the change in deviance test becomes H o : β i = 0 versus H A : β i 0 This likelihood ratio is a competitor of the Wald test (see 28)) The likelihood ratio test is usually better than the Wald test if the sample size N is not large, but the Wald test is currently easier for software to produce For large N the test statistics from the two test tend to be very similar (asymptotically equivalent tests) 33) If the reduced model is good, then the EE plot of ˆα R + ˆβ T R x Ri versus ˆα + ˆβ T x i should be highly correlated with the identity line with unit slope and zero intercept Know how to use the following output to test the reduced model versus the full model 3

4 Response = Y Terms = (X 1,, X k ) (Full Model) Label Estimate Std Error Est/SE p-value Constant ˆα se(ˆα) z o,0 for Ho: α =0 x 1 ˆβ1 se( ˆβ 1 ) z o,1 = ˆβ 1 /se( ˆβ 1 ) for Ho: β 1 =0 x k ˆβk se( ˆβ k ) z o,k = ˆβ k /se( ˆβ k ) for Ho: β k =0 Degrees of freedom: N - k - 1 = df FULL Deviance: D = G 2 FULL Response = Y Terms = (X 1,, X m ) (Reduced Model) Label Estimate Std Error Est/SE p-value Constant ˆα se(ˆα) z o,0 for Ho: α =0 x 1 ˆβ1 se( ˆβ 1 ) z o,1 = ˆβ 1 /se( ˆβ 1 ) for Ho: β 1 =0 x m ˆβm se( ˆβ m ) z o,m = ˆβ k /se( ˆβ m ) for Ho: β m =0 Degrees of freedom: N - m - 1 = df RED Deviance: D = G 2 RED Data set = Banknotes, Name of Fit = B1 (Full Model) Response = Status Terms = (Diagonal Bottom Top) Coefficient Estimates Label Estimate Std Error Est/SE p-value Constant Diagonal Bottom Top Degrees of freedom: 196 Deviance: 0009 Data set = Banknotes, Name of Fit = B2 (Reduced Model) Response = Status Terms = (Diagonal) Coefficient Estimates Label Estimate Std Error Est/SE p-value Constant Diagonal Degrees of freedom: 198 Deviance:

5 34) Let π(x) = P (success x) = 1 P(failure x) where a success is what is counted and a failure is what is not counted (so if the Y i are binary, π(x) =P (Y i =1 x)) Then the estimated odds of success is ˆΩ(x) = ˆπ(x) 1 ˆπ(x) 35) In logistic regression, increasing a predictor x i by 1 unit (while holding all other predictors fixed) multiplies the estimated odds of success by a factor of exp( ˆβ i ) 36) Suppose that the binary response variable Y is conditionally independent of x given a single linear combination β T x of the predictors, written Y x β T x If the LR model holds and if the first SIR predictor ˆβ T SIR1 x and ˆα + ˆβ T x are highly correlated, then ( to a good approximation) Y x ˆα + ˆβ T x and Y x ˆβ T SIR1 x To make a binary response plot for logistic regression, fit SIR and the LR model and assume that the above conditions hold Place the first SIR predictor on the horizontal axis and the 2nd SIR predictor ˆβ T SIR2x on the vertical axis If Y = 0 use symbol 0 and if Y = 1 use symbol X If the LR model is good then consider the symbol density of X s and 0 s in a narrow vertical slice This symbol density should be approximately constant (up to binomial variation) from the bottom to the top of the slice (Hence the X s and 0 s should be mixed in the slice) The symbol density may change greatly as the slice is moved from the left to the right of the plot, eg from 0% to 100% If there are slices where the symbol density is not constant from top to bottom, then the LR model may not be good (eg a more complicated model may be needed) 37) Given a predictor x, sometimes x is not used by itself in the full LR model Suppose that Y is binary Then to decide what functions of x should be in the model, look at the conditional distribution of x Y = i for i =0, 1 These rules are used if x is an indicator variable or if x is a continuous variable distribution of x y = i functions of x to include in the full LR model x y = i is an indicator x x y = i N(µ i,σ 2 ) x x y = i N(µ i,σi 2 ) x and x 2 x y = i has a skewed distribution x and log(x) x y = i has support on (0,1) log(x) and log(1 x) 38) If w is a nominal variable with J levels, use J 1 (indicator or) dummy variables x 1,w,, x J 1,w in the full model 39) An interaction is a product of two or more predictor variables Interactions are difficult to interpret Often interactions are included in the full model and the reduced model without any interactions is tested The investigator is hoping that the interactions are not needed 5

6 40) A scatterplot of x vs Y is used to visualize the conditional distribution of Y x A scatterplot matrix is an array of scatterplots It is used to examine the marginal relationships of the predictors and response Place Y on the top or bottom of the scatterplot matrix and also mark the plotted points by a 0 if Y = 0 and by X if Y =1 Variables with outliers, missing values or strong nonlinearities may be so bad that they should not be included in the full model 41) Suppose that all values of the variable x are positive The log rule says add log(x) to the full model if max(x i )/ min(x i ) > 10 42) To make a full model, use points 37), 38), 40) and 41) and sometimes 39) The number of predictors in the full model should be much smaller than the number of data cases N Make an ESS plot to check that the full model is good 43) Variable selection is closely related to the change in deviance test for a reduced model You are seeking a subset I of the variables to keep in the model The AIC(I) statistic is used as an aid in backward elimination and forward selection The full model and the model with the smallest AIC are always of interest Create a full model The full model has a deviance at least as small as that of any submodel 44) Backward elimination starts with the full model with k variables and the predictor that optimizes some criterion is deleted Then there are k 1 variables left and the predictor that optimizes some criterion is deleted This process continues for models with k 2,k 3,, 3 and 2 predictors Forward selection starts with the model with 0 variables and the predictor that optimizes some criterion is added Then there is 1 variable in the model and the predictor that optimizes some criterion is added This process continues for models with 2, 3,, k 2 and k 1 predictors Both forward selection and backward elimination result in a sequence of k models {x 1 }, {x 1,x 2 },, {x 1,x 2,, x k 1 }, {x 1,x 2,, x k } = full model 45) For logistic regression, suppose that the Y i are binary for i = 1,, N Let N 1 = Y i = the number of 1 s and N 0 = N N 1 = the number of 0 s Rule of thumb: the final submodel should have m predictor variables where m is small with m min(n 1,N 0 )/10 46) Know how to find good models from output A good submodel I will use a small number of predictors, have a good ESS plot, and have a good EE plot A good LR submodel I should have a deviance G 2 (I) close to that of the full model in that the change in deviance test 31) would not be rejected Also the submodel should have a value of AIC(I) close to that of the examined model that has the minimum AIC value The LR output for model I should not have many variables with small Wald test p values 47) Heuristically, backward elimination tries to delete the variable that will increase the deviance the least An increase in deviance greater than 4 (if the predictor has 1 degree of freedom) may be troubling in that a good predictor may have been deleted In practice, the backward elimination program may delete the variable such that the submodel I with j predictors has 1) the smallest AIC(I), 2) the smallest deviance G 2 (I) or 3) the biggest p value (preferably from a change in deviance test but possibly from a Wald test) in the test Ho β i = 0 versus H A β i 0 where the current model with j +1 6

7 variables is treated as the full model 48) Heuristically, forward selection tries to add the variable that will decrease the deviance the most An increase in deviance less than 4 (if the predictor has 1 degree of freedom) may be troubling in that a bad predictor may have been added In practice, the forward selection program may add the variable such that the submodel I with j predictors has 1) the smallest AIC(I), 2) the smallest deviance G 2 (I) or 3) the smallest p value (preferably from a change in deviance test but possibly from a Wald test) in the test Ho β i = 0 versus H A β i 0 where the current model with j terms plus the predictor x i is treated as the full model (for all variables x i not yet in the model) 49) For logistic regression, let N 1 = number of ones and N 0 = N N 1 = number of zeroes A rough rule of thumb is that the full model should use no more than min(n 0,N 1 )/5 predictors and the final submodel should use no more than min(n 0,N 1 )/10 predictors 50) For loglinear regression, a rough rule of thumb is that the full model should use no more than N/5 predictors and the final submodel should use no more than N/10 predictors 51) Variable selection is pretty much the same for logistic regression and loglinear regression Suppose that the full model is good and is stored in M1 Let M2, M3, M4, and M5 be candidate submodels found after forward selection, backward elimination, etc Make a scatterplot matrix of M2, M3, M4, M5 and M1 Good candidates should have estimated linear predictors that are highly correlated with the full model estimated linear predictor (the correlation should be at least 09 and preferably greater than 095) For binary logistic regression, mark the symbols using the response variable Y See HW7 1, HW8 1, HW9 1 and HW ) The final submodel I should have few predictors, few variables with large Wald p values (001 to 005 is borderline), a good ESS plot and an EE plot that clusters tightly about the identity line Do not use more predictors than the min AIC model I min and want AIC(I) AIC(I min ) + 7 For the change in deviance test, want pvalue 001 for variable selection (instead of δ = 005) If a factor has J-1 dummy variables, either keep all I-1 dummy variables or delete all J-1 dummy variables, do not delete some of the dummy variables 53) Know that when there is perfect classification in the binary logistic regression model, the LR MLE estimator does not exist and the output is suspect However, often the full model deviance is close to 0 and the deviance test correctly rejects Ho 54) Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Poisson(µ(x i )) for i =1,, N where ˆµ(x) = exp(ˆα + ˆβ T x) This is called the full model for loglinear regression and the (k + 1) parameters α, β 1,, β k are estimated Know how to predict ˆµ(x) Also Ŷ =ˆµ(x) See HW9 2, Q8 7

8 For the saturated model, the Y i X i = x i independent Poisson(µ i ) for i =1,, N where ˆµ i = Y i This model estimates the N parameters µ i The saturated model is usually bad An exception is when all NY i are large The comments on the deviance in the middle of p 1 still hold 55) An estimated sufficient summary or EY plot is a plot of w i =ˆα + ˆβ T x i versus Y i with the exponential curve of estimated means ˆµ(w i )=e w i added to the plot along with a lowess curve 56) Suppose that w i takes many values (eg the LLR model has a continuous predictor) and that k +1 << N Know that the LLR model is good if the lowess tracks the exponential curve of estimated means in the ESS plot Also know that you should check that the LLR model is good before doing inference with the LLR model See HW9 2 57) Know how to perform the 4 step deviance test This test is almost exactly the same as that in 30), but replace LR by LLR in the conclusion The output looks almost like that shown on p 2 See HW9 2, Q8 The deviance test for LLR asks whether ˆµ(x i ) from LLR should be used to estimate µ(x i ) or should none of the predictors be used so ˆµ = Y = N i=1 Y i /N 58) Know how to perform the 4 step Wald test This test is the same as 28) except replace LR by LLR 59) Know that a (Wald) 95% CI for β i is ˆβ i ± 196SE( ˆβ i ) 60) Know how to perform the 4 step change in deviance test The output is almost the same as that on p 4 and the test is exactly the same as that given in 31) For Ho, the parameters set to 0 are those that are in the full model but not the reduced model 61) Know what a lurking variable is 62) Know the difference between an observational study and an experiment A clinical trial is a randomized controlled experiment performed on humans Exam 3 is on Wednesday, April 19 and covers Agresti material including points 23) through 28) on the Exam 2 review 7 pages of notes You should know how to use a random number table to draw a simple random sample in order to divide units into 2 groups In Agresti, we have covered ch 1, 21, 22, 23, 24, 51, 52, 53, 54, and 55 but have skipped subsections 245, 246, 247, 533, 534, and 556 8

The material for categorical data follows Agresti closely.

Exam 2 is Wednesday March 8 4 sheets of notes The material for categorical data follows Agresti closely A categorical variable is one for which the measurement scale consists of a set of categories Categorical